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1. INTRODUCTION 

The huge deployment of surveillance camera systems in public areas in recent years has increased 
the demand of new systems that can automatically analyze video surveillance streams in real-time. 
Automatically detecting abnormal events in complicated and crowded scenes is a challenging task in 
intelligent video surveillance. This problem has attracted significant computer vision research interest in 
recent years. In this work, we aim to present and evaluate the anomaly detection approaches and deep 
learning-based methods, to automatically detect and localize anomalous events in which subject knowledge is 
continuously evolving. In this section, the research topic, background information, the research objectives are 
covered in order to introduce the study and finally the paper structure. 

Anomaly detection in the video is the task of recognizing frames from a video sequence that reflect 
occurrences that differ significantly from the normal, identifying unusual incidents, such as fires, car 
accidents, escapes, stampedes, or fighting, and can be quite useful [1], [2]. The detection and localization of 
the anomaly are one of the most difficult tasks in video processing due to the definition of “anomaly” which 
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can have some degree of ambiguity within context. Visual behaviors are complicated and diverse in an 
unrestricted world, complicated backgrounds, moving cameras, occlusion, shadows, and lighting are 
challenges to overcome. In general, an occurrence is regarded as an "anomaly" if it occurs infrequently or 
unexpectedly [3], [4]. 

Anomaly detection is a growing field of research in and of itself. Although various methods have 
been put out to address this issue, they all have their limitations. Whereas, the inclusion of a labeled dataset 
with a collection of normal events is a requirement for the majority of approaches currently in use [1]. This 
presumption restricts their field of use because it prevents the system from being continuously retrained 
without human intervention. 

Various approaches have been proposed, early literature relies on trajectory-based techniques 
[5], [6]. These techniques attempt to determine the target’s trajectories by using visual tracking and a model 
is learned to describe normal actions. Then the anomaly is defined as an activity related to trajectories that 
differ significantly from the learned model. Though, these techniques are ineffective for complex and 
crowded scenes due to their high temporal complication and the occlusion issue caused by moving objects 
[7]. Therefore, more lately, non-object-centered unsupervised approaches have been more commonly used. 
These approaches tackle the problem of anomaly identification by learning representative activity patterns 
from the behavior-related characteristics of objects and humans in spatial and temporal contexts. Size, 
gradient, speed, and direction of the targets in the image are typically taken into account as behavioral 
attributes and are expressed with low-level representations like 3D spatio-temporal gradient, histogram of 
optical flow (HOF), histogram of oriented gradients (HOG) [8], and dense spatial-temporal interest points 
(dense STIPs). These methods have an advantage over trajectory-based methods in that they work at the pixel 
level, which makes them more robust in complicated scenes [7]. 

Dictionary learning is another proposed approach for anomaly event detection; this approach 
develops a dictionary of typical events and labels the events that the dictionary cannot adequately depict as 
abnormal. Low-level features like 3D gradient features and HOF or HOG features may also be subject to 
dictionary learning [1]. However, all of these methods depend on hand-crafted features that are difficult to 
describe a priori because there are so many different types of anomalous behaviors. In addition, they are 
unable to adapt to abnormalities that have never encountered before [7]. 

Recently, a variety of computer vision tasks have been successfully tackled using deep learning 
approaches, surpassing the state-of-the-art in a variety of difficult problems. Such as object classification 
[9]-[11], object detection [12], [13], and action recognition [8], [14], [15]. Deep learning is a subtype of 
machine learning that achieves high performance by learning to represent the information as a hierarchy of 
nested concepts within layers of the neural network [16]. As the volume of data increases, deep learning 
outperforms classical machine learning as illustrated in Figure 1. 
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Figure 1. Deep learning-based algorithms’s performance in comparison to traditional algorithms [16] 


The deep learning-based methods for anomaly detection use one of these techniques: the 
reconstruction error to calculate the test data divergence from a series of normal training videos, the future 
frame prediction, the classifiers, or the scoring methods. Most of these techniques, specifically on 
“traditional” approaches, presuppose the existence of a labeled dataset that represents a collection of ‘normal’ 
events. In this work, we present a variety of contributions that tackle these issues. Especially we focus on 
deep learning-based methods to solve this issue. Today, these solving approaches based on deep learning are 
rapidly and constantly evolving, which makes it particularly difficult to master this area of expertise. Unlike 
the previous review papers which are general and tackle the anomaly detection problem in many fields, our 
paper is more specific for anomaly detection in video surveillance context using deep learning approaches 
and it covers this problem from different sides: techniques, used dataset, and metrics. 
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This paper is organized as follows: the first section serves as an introduction, in the second section, 
we review the deep learning-based methods for anomaly detection in surveillance video. In the third section, 
we provide the publicly available dataset and in the fourth section, we describe the most used evaluation 
metrics in order to evaluate and compare the methods. In the fifth section, we compare and discuss the results 
of different approaches according to several datasets. Finally, we terminate this paper with a conclusion. 


2. DEEP LEARNING-BASED METHODS FOR ANOMALY DETECTION 

Deep learning algorithms have proven effective in a variety of computer vision tasks, such as object 
classification [9], [17], object detection [12], [18], and action recognition [19], [20], including anomaly 
detection in video surveillance. As already introduced in the previous section, the approaches that have been 
proposed to tackle this challenge can be grouped into four categories: reconstruction error, future frame 
prediction, classifiers, and scoring. 


2.1. Reconstruction error based methods 

The reconstruction error is one of the most used approaches for solving the anomaly detection problem. The 
basic presumption of using the reconstruction error is that would be smaller for normal samples, because they are 
closer to the training data, and assumed to be higher for abnormal samples [21]. Deep learning-based methods 
typically train a deep neural network using an auto-encoder (AE) method and use it to reconstruct normal events with 
few reconstruction errors. But as it was claimed in [22], larger reconstruction errors for anomalous events don’t 
necessarily happen. As a result, it can show that practically many methods based on the reconstruction of training data 
cannot guarantee the detection of abnormal events. 

A method was proposed in [23] to learn normal patterns with minimal supervision using 
autoencoders; firstly, the authors use the conventional hand-crafted Spatio-temporal local features to train an 
autoencoder. The value of using this type of information for training is their capacity to work without or with 
minimal supervision. Then, they develop a fully convolutional AE to learn the classifiers and the local 
features in one framework. 

Another method was proposed in [24] where the authors used generative adversarial networks 
(GANSs) [25], which employ normal frames and associated optical-flow images as training data to learn the 
normal frame representation. The GANs cannot generate abnormal events because they have only been 
trained on normal data. Therefore, to detect abnormalities, a local differential between the actual and 
produced images is used during testing time. In future work, it could be possible to use dynamic images [26] 
to represent motion data. 

Similarly, the work of [27] has also used GANs [28] and performs transfer learning algorithms on 
pre-trained CNN (VGG16). Transfer learning is a vital machine learning technique for addressing the 
fundamental issue of insufficient training data. Its goal is to transfer knowledge from one domain to another 
[16]. They also improve the model's effectiveness by processing the video's optical-flow information. The 
experiment of this work runs on University of California, San Diego (UCSD) datasets, and for the evaluation, 
they use various criteria such as area under the receiver operating characteristic (ROC) curve (AUC) and 
equal error rate (EER). 

Vu et al. [28], propose an approach based on two-fold. They propose a customizable multi-channel 
framework for generating multi-type frame-level characteristics on one side and on other side; they 
investigate how supervised learning can be used to increase detection performance. The multi-channel 
framework that they propose is composed of four conditional GANs (CGANs) [29] that take various types of 
motion and appearance data as input and produce prediction data as output. Then peak signal-to-noise ratio 
(PSNR) is used to encode the difference between the generating and ground-truth information. For frame- 
level anomaly detection, the binary support vector machines (SVM) is used. Finally, they perform object- 
centric anomaly localization by using mask region-based convolutional neural networks (R-CNN) as a 
detector. They evaluate their solution on four different datasets: avenue, ShanghaiTech, and UCSD. 

Sabokrou et al. [21], propose an approach for anomaly detection and localization based on two 
cubic patches, where one relies on the strength of an autoencoder to reconstruct an input video patch, while 
the other relies on the strength of sparse representation of an input video patch. These two stages are 
constructed based on the analysis of the reconstruction error of the AE and the sparsity value (SV). The main 
idea of their approach is that the anomaly patch in the testing phase has a more elevated reconstruction error 
than a normal patch if an AutoEncoder has been trained successfully on the normal patches. 


2.2. Scoring based methods 


There is another category of methods proposed by researchers based on score [6], [21], [22], [30]; 
the main idea of this approach is to generate an anomaly score that may be used to determine whether or not 
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a video segment or frame is abnormal. Sultani et al. [30] propose an approach to learn anomalies by utilizing 
both normal and abnormal videos; they postulated that the best way to detect anomalies might not be to use 
only normal data. Therefore, to save the time-consuming task of marking anomalous portions in training 
films, they suggest using weakly labeled training videos to learn anomalies using the deep multiple instance 
ranking system [31]. 

In their approach, the authors learn an anomaly ranking model that automatically predicts high 
anomaly scores for anomalous video segments by treating video segments as instances in multiple instances 
learning (MIL) and normal and abnormal videos as bags. MIL is a deep learning technique where training 
data is organized in bags, and each bag contains a collection of instances [32]. Research by Pang et al. [1], try 
to solve the problem by end-to-end anomaly scores learning on a collection of video frames without 
explicitly labeling any data as normal or abnormal. For that, they propose an end-to-end approach based on 
self-trained deep ordinal regression to detect the anomaly in the video. This approach overcomes some 
limitations of existing methods, the first one relies on manually labeled normal training data, and the second 
one is sub-optimal feature learning. 

The framework that has been proposed receives a collection of videos without labels and then 
initially carries out initial detection to produce a set of pseudo anomalous and normal frames. Then, these 
collections are used to train a ResNet-50 model [33] and a fully connected network in an end-to-end fashion. 
ResNet50 is a pre-trained model that has the ability of take frame appearance characteristics. The network is 
composed of an output layer with one linear unit and a hidden layer with 100 units. Finally, the anomaly 
scores of all frames are then recalculated using the trained model. The abnormal and normal memberships are 
updated as needed, and the process is repeated. 

Another method was proposed by Xu et al. [7] where they have proposed an unsupervised learning 
approach to learn feature representations automatically. They propose a new double fusion architecture to 
take advantage of the complementing information contained in both appearance and movement patterns, 
combining typical early fusion and late fusion advantages. In the early fusion, it is proposed to use stacked 
denoising auto-encoders (SDAE) to learn both the motion and appearance features of activities in a video 
separately. Then, they employ multiple one-class SVM models to predict the anomaly scores of each input 
using the learned features. Finally, the late fusion combines the obtained scores and detects anomalous 
events. As claimed by the author, this work is the first effort to tackle the challenge of abnormal event 
identification using deep learning. Despite the good results achieved, the approach still has a limit that is 
represented in the high computational for real-time processing. Therefore, in the future, it might be possible 
to research ways to cut the cost of computation. 


2.3. Future frame-based methods 

This approach is considered as another sight to address the anomaly detection challenge within a 
future frame prediction. The assumption of its use is that normal events are predictable whereas abnormal 
ones do not match expectations. The first work that introduces this approach is that of [22]. In which the 
authors propose a future frame prediction network. This approach is based on the generator-discriminator 
structure assimilated to that of a GAN network, and they use a U-net model as a prediction network to create 
a future frame while the discriminator at the end of the network determines whether or not the predicted 
frame is abnormal. Moreover, to predict a higher-quality future frame for normal events in addition to 
appearance constraints that are commonly used, they also use a motion constraint by forcing the optical flow 
between the ground truth and the anticipated frames. 

Another method was proposed by Medel and Savakis [34], where they used a future frame 
prediction approach. Their approach is based on developing generative models that, with limited supervision, 
can detect anomalies in videos. They suggest a composite convolutional long short-termmemory (Conv- 
LSTM) network that is end-to-end trainable and can anticipate the development of a video sequence given a 
few input frames and predict future frames. The network learns to predict ‘normal' activities that are 
comparable to those seen in the training videos. And with each succeeding timestep, the abnormality forecast 
deviates further from the ground truth. As a result, the regularity score produced can be used to identify when 
abnormalities occur in videos. At the evaluation level, the authors did not use the most used matrices for 
evaluating results and making comparisons with other methods like AUC and EER. 


2.4. Classifier based methods 

The work of Medel and Savakis [4] framed the anomaly detection problem as a classification 
problem. They proposed an approach for locating and detecting anomalies in videos by analyzing the output 
of deep layers, their approach uses fully convolutional neural networks (FCNNs) and information about time. 
The proposed FCN combines a pre-trained CNN using an AlexNet model [9] with a novel convolutional 
layer that trains kernels with regard to the training video. The network focuses on two key tasks: outlier 
detection and feature representation. This approach proved good results in terms of accuracy but it still has 
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some limitations, it occurs false positives in some cases like when people walk in different directions and 
when we have crowded scenes. Summary of past literature for anomaly detection techniques is shown in 


ISSN: 2302-9285 


Table 1. 
Table 1. Deep learning-based approaches for anomaly detection 
Year Learning type ___ Approach Contribution Used techniques Ref 
Weakly Self-trained deep ordinal regression for end- ResNet50, end-to-end, self- 
2020 i Score : d ne A $ [1] 
supervised to-end video anomaly detection training, ordinal regression 
: : ere MIL 
2018 Supervised Score Using MIL to predict high anomaly scores for sparsity, temporal 30] 
weakly anomalous video segments 
smoothness 
2016 limited Reconstru learn normal patterns using autoencoders with Fully convolutional 23] 
supervision ction error limited supervision autoencoder 
Traine GAN to to learn an internal 
2017 Unsupervised Reconstru representation of scene normality using GAN, optical-flow 24] 
ction error normal frames and related optical-flow 
images 
2021 Unsupervised Reconstru multi-channel framework based on 4 CGAN CGAN, SVM, Full Flow; 28] 
sup ction error to generate multi-type frame-level features Mask R-CNN 
2018 Unsupervised Future future frame prediction network for anomaly GAN, U-Net, optical-flow 22] 
frame detection 
2016 Limited Future end-to-end trainable composite Conv-LSTM Conv-LSTM 35] 
supervision frame networks 
i : GAN, transfer learning 
2020 Unsupervised Recoñstiu Abnormal event detection . pre-trained CNN (VGG16) 27] 
ction error using GAN and transfer learning ; 
Optical flow 
n Classificat FCN: the combination of a pretrained CNN FCNN, Alexnet, Gaussian 
2018 Unsupervised ion (AlexNet) and a novel convolutional layer classifier [4] 
y Sparse auto-encoder (SAE) 
i 7 AMDN: unsupervised learning approach SDAE, multiple one-class 
2017 Unsupervised BERK based on deep learning architectures SVM models, fine-tuning [7] 
2016 Unsupervised Reconstru Two cubic patch approach based on AE and AE [21] 
ction error sparse representation SAE 


3. BENCHMARK DATASETS 

In this part, we describe the public datasets used for the anomaly detection tasks in the video. Many 
of the papers attempted to use at least one benchmark dataset to compare the performance of their suggested 
methods to previously published papers. Due to the variable crowd density and behavior patterns, all datasets 


exhibit dynamic scenarios. The datasets frequently used for activities involving anomaly detection are listed 
in Table 2. 


Table 2. A comparison of anomaly datasets 


Number Number Average Training Testing Anomalous : DATASET Number Examples of 
Dataset ; ; ; Resolution : 
of videos of frames frames video video events length of scenes Anomalies 
Subway 1 144,249 144,249 15 min 66 512x384 1,5 hours 1 No payment, 
entrance loitering, Wrong 
way 
Subway 1 64,900 64,900 15 min 19 512x384 43 min 1 Wrong direction, 
exit loitering 
UMN 11 ~7,700 1,290 11 320x240 5 min 3 Run 
UCSD 50 14,000 200 34 16 40 238x158 5 min 1 Small cars, 
Ped1 [36] skaters, walking in 
the grass 
UCSD 70 4,560 163 16 12 12 360x240 5 min 1 Skaters, small 
Ped2 [37] cars, bikers 
CUHK 28 35,240 2,120 16 21 14 640x360 30 min 1 Running, throwing 
avenue objects, and 
[38] loitering 
Street 15 203,257 46 35 205 1280x720 1 
scene 
Shanghai 437 317,398 330 107 130 856x480 13 
tech 
UCF- 1900 ~13.8M 4,052 1,610 290 13 320x240 128 hours n Burglary, fights, 
crime [30] robbery, accidents 


on the road 
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3.1. UCSD pedestrian 

The UCSD pedestrian dataset [37] contains 2 subsets: the UCSD Peds1 dataset and the UCSD Peds2 
dataset, the size of the frame and the camera angle distinguish the two subsets. The dataset is divided into 
testing and training data. The training data is devoid of abnormalities, it is all normal activities and contains 
only pedestrians; however there is at least one anomaly in every testing clip, the anomalous events are either: 
object entities moving via pathways or anomalous people motion. Common anomalies contain small cars, 
skaters, bikes, and people walking in the grass, in certain frames, the anomalies appear in multiple locations. 

UCSD pedestrian 1: this dataset has 34 video sequences for training, and 16 video sequences for 
testing in which one or more anomalies are present in some of the frames, pixel-level binary masks are given 
to a collection of ten clips in the testing set to identify regions having anomalous events, each clip contains 
about 200 frames. There are 5,500 normal and 3,400 abnormal frames, with a resolution 158x238 pixels. In 
this dataset, The camera is positioned at a considerable height. 

UCSD pedestrian 2: this dataset contains around 1,652 anomalous and 346 normal frames across 12 
testing and 16 training video sequences. The frame has a 360 by 240 pixel resolution. The camera here is 
placed at a lower altitude. Each testing clip in this dataset has only one anomalous event, which takes up the 
majority of the video segment. 

Different works are usually evaluated independently on these two datasets. But due to the different 
camera viewpoints, Ped! appears to be more challenging than Ped2. Figure 2 shows sample frames from the 
UCSD dataset for both normal and abnormal behavior in the scene and their ground truth. 


Train (Normal) Test Avnonnal Ground truth 


Figure 2. Samples from the UCSD dataset; left column illustrates normal pedestrian behavior, the middle 
shows the anomaly behavior in the scene and the right column shows their ground truth 


UCSD Ped1 


UCSD Ped2 


3.2. CUHK avenue 

Chinese University of Hong Kong (CUHK) avenue dataset [39] includes 16 video sequences for 
training and 21 video sequences for testing, each video around 2 minutes long. There are a total of 47 
anomalous events, like throwing objects, running, loitering, and going the wrong way. Due to the camera 
position and viewpoint, people's sizes may vary. The training video contains generally normal events. 
However, there are a few abnormal situations. The number of normal samples in the test set is greater than 
the number of abnormal samples. Figure 3 illustrates sample frames of abnormal behavior from the CUHK 
avenue dataset. 
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Figure 3. Samples of abnormal behavior from the CUHK avenue dataset 
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3.3. Subway dataset 

The subway dataset [40] comprises 2 video sequences recorded at the access point (144,249 frames, 
1 hour 36 minutes long) and exit door (64,900 frames, 43 minutes long) of a subway station. The abnormal 
events mainly include individuals traveling in the opposite direction and no-payment events. The number of 
anomalies in this dataset are low. Figure 4 shows sample frames from the Subway dataset for both normal 
and abnormal events. Subway entrance: the surveillance video from the subway entrance shows a variety of 
anomalous events, such as people loitering, walking in the opposite way, and avoiding payment. Subway 
exit: similar anomalies to those seen in the subway entrance video can be seen in the surveillance video of the 
subway exit. 
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Figure 4. Samples from the subway dataset; the top row displays regular events, whereas the bottom row 
displays abnormal ones 


3.4. UMN dataset 

The University of Minnesota (UMN) dataset comprises 3 distinct sights of escape incidents, with a 
total number of frames 7740 (1,450 for scene 1, 4,415 for scene 2, and 2,145 for scene 3) and the resolution 
is 320x240. The abnormal activities are people spreads running at the same moment, while the normal events 
are pedestrians wandering aimlessly around the plaza or through the mall. There are 11 abnormal events in 
the entire video collection. Figure 5 illustrates example frames from the UMN dataset. 
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Figure 1. Samples from the UMN dataset; top row depicts normal crowd behaviour, while the bottom row 
depicts panicked crowd behavior 


F 


Bulletin of Electr Eng & Inf, Vol. 12, No. 1, February 2023: 314-327 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 Oo 321 


3.5. ShanghaiTech dataset 

The ShanghaiTech dataset includes 330 videos for training and 107 videos for testing, with over 
270,000 training frames. There are 130 abnormal events and numerous forms of anomalies with 13 scenarios 
that incorporate difficult lighting and camera positions. Furthermore, the ground truth of abnormal events is 
labeled. On the test set, normal samples outnumber abnormal samples, Figure 6 shows sample frames from 
this dataset for both abnormal and normal behavior. 


Normal Abnormal 


Figure 2. Normal and abnormal frames, the red box denotes an anomaly in an anomalous frame. 


3.6. UCF dataset 

The University of Central Florida (UCF) dataset is a sizeable dataset proposed by [30] to help solve 
the anomaly detection problem with about 128 hours of videos. It contains 1,900 lengthy actual surveillance 
movies, with 13 realistic abnormalities, including burglary, fights, robbery, accidents on the road, and also 
the normal activities. This dataset can be utilized for two different purposes. First, all anomalies are taken 
into account in one group, while all normal events are taken into account in another. Second, to identify each 
of the 13 anomalous activities. There are 15 times as many movies in this dataset as there are in other 
datasets. Figure 7 shows few examples of anomalies from the UCF dataset. 


Stealing 


Figure 3. Samples of anomalies from the UCF dataset 


4. EVALUATION METRICS 
In this section, we will discuss the evaluation and comparison measures used in state-of-the-art 
methods. 
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- Frame level: a frame is deemed to have detection if it has at least one abnormal pixel. Each frame's 
ground truth annotation is compared to these detections. The process is carried out several times for 
different thresholds to create a ROC curve. This assessment does not confirm that the detection 
corresponds to the actual location of the anomaly. Therefore, some actual positive detections may be the 
result of "fortunate" co-occurrences of false positives and abnormal events [37]. 

- Pixel level: the accuracy of localization is evaluated by comparing detections to pixel-level ground truth 
masks, on a collection of ten clips. The process is comparable to what was previously stated. The frame is 
deemed as accurately detected if at least 40% of the actually anomalous pixels are found. otherwise, it is 
tallied as a false positive [37]. 

- ROC curve: to evaluate the accuracy for various threshold settings, the ROC curve is employed. The 
ROC is composed of false positive rate (FPR) and true positive rate (TPR), where FPR determines the 
proportion of false-positive findings that occur as compared to the total number of negative samples 
available through the test stage, and TPR defines a classifier test performance on accurately categorizing 
positive instances among all available positive samples throughout the test stage. These measurements are 
provided by (1) and (2): 


True positive 
TPR= 


z — 1 
False negative+True positive ( ) 
False positive 
FPR = p — (2) 
True negative+False positive 


where true positive (TP) denotes the anomalous events that have been properly identified; true negative (TN) 
denotes the normal events that have been properly identified, false positive (FP) denotes the anomalous 
events that have been improperly identified; and false negative (FN) denotes the normal events that have 
been improperly identified. We select several thresholds for both frame-level and pixel-level detection and 
compute the TPR and FPR in accordance to produce the ROC curve [39]. 

The AUC is employed as the evaluation metric. The ground truth and frame-level anomaly scores 
are used to calculate AUC. Figure 8 illustrates the area under the ROC curve. 

The EER is the proportion of incorrectly categorized frames when the FPR and the miss rate are 
both equal. The lower the EER value, the higher the accuracy of the algorithm. The EER is a point in the 
ROC at the junction of the curve and a line going from (0.1) to (1.0). Figure 8 illustrates the EER. Time 
complexity is another important criterion. If an algorithm's overall execution time is sufficiently short, it is 
more appealing to be used in many applications. 


ROC curve 
EER line 


True positive rate (TPR) 
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Figure 4. EER and ROC curve [27] 


5. COMPARISON AND DISCUSSION 

In this section, we will discuss and analyze the performance of anomaly detection methods in videos 
sequences, exactly those based on deep learning approaches. Table 3 (in Appendix) lists the approaches 
discussed in the previous sections and other papers that tackle the anomaly detection problems in accordance 


Bulletin of Electr Eng & Inf, Vol. 12, No. 1, February 2023: 314-327 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 Oo 323 


with the publicly available datasets. These approaches are grouped by the type of learning used and some 
evaluated metrics results obtained by applying some anomaly detection methods on different datasets. The 
comparison of accuracy between different methods is done by their frame and pixel-level scores. 

We have classified papers based on deep learning into four categories of approaches: reconstruction 
errors, future frame prediction, scoring, and using classifiers. The accuracy of each one is tested on several 
datasets and evaluated using AUC and EER metrics for both frame and pixel-level. As shown in Table 3 
(in Appendix), the deep learning-based methods are achieved good results for the most available dataset 
compared to hand-crafted based methods, except some methods for some specific dataset like [40] which has 
the lowest EER value (10%) compared to all the others methods for UCSD Ped2 dataset, and the method of 
[37] in subway entrance dataset, and also the method of [41] that achieved an accuracy of 99.70% in UMN 
dataset. 

The analysis of the deep learning-based methods results demonstrates that the reconstruction errors 
are the most used approach and gives a superior accuracy in UCSD datasets for both frame and pixel-level, as 
shown in [23], [24], [28]. But in some situations, the larger reconstruction errors for anomalous events may 
not happen because of the higher capacity of the deep neural network. Whereas score approach has also 
achieved good results for some other datasets as in [42], especially in the subway exit (AUC=95,1%) and 
UMN(AUC=99,83%) datasets. In addition, the approach presented in [30], has also given a good accuracy 
(AUC=75,41%) in their dataset UCF compared to the results of other approaches, but it could not locate 
exactly the anomaly in some situations. 

For the classifier approach, we can see that the approach presented in [43] has achieved good 
accuracy (AUC=97,80%) for the UCSD Ped2 dataset, but this approach generates a high rate of false- 
positives (AUC=68,4% in UCSD Ped1 dataset) in 2 situations: when people walk in the wrong way and in 
the crowded scenes. Despite the future frame prediction approach proves its effectiveness for anomaly 
detection on some datasets (AUC=95,4% in UCSD Ped2). In the avenue dataset, it fails to detect several 
anomalous events of jogging that occur in the background, because it could not differentiate jogging action 
from walking pedestrians. In general, using some datasets is more challenging than others. For example, all 
approaches give good results using the UMN dataset, due to its simplicity. But in UCF dataset, the higher 
result obtained is (AUC=75,41%). 

Based on the reviewed literature papers and the results of Table 3 (in Appendix) 
[1], [4], [7], [21]-[24], [27], [28], [30], [35]-[37], [40]—[60], it appears clearly that several studies choose to 
tackle the anomaly detection problem using unsupervised learning methods, because do not require labeled 
video data and can be effectively employed for learning good representations. In addition, it is effective to 
the complexity and variety of visual behaviors of anomaly in an unconstrained environment. However, they 
still limited and did not achieve good results. Therefore, other researchers choose to surpass this limit by 
using the semi-supervised learning methods that use data only related to the "normal" class, thus these 
methods have greater specifications for anomaly detection problem as well as unsupervised methods, which 
only use the structure and configuration of the unlabeled data and do not use any other information. 

Despite the very huge researchers in this topic, however, it still has some limits; many anomaly- 
detection algorithms work with very regular scenes, so it is necessary to evaluate how well these methods 
operate in less structured situations. Moreover, the real time application in unconstrained environment and 
the time complexity. Therefore, we propose to use the vision transformer model [61], which is a new deep 
learning technique that achieve good results in many problems and it could be a good approach to implement 
for anomaly detection problem. 


6. CONCLUSION 

This paper reviews deep learning-based methods for video anomaly detection, which cover a variety 
of approaches, techniques, datasets, and evaluation metrics. A thorough overview of anomaly detection 
should ideally enable readers to comprehend not just the rationale for using a specific technique, but also to 
compare different techniques and produce a comparative analysis, in addition to propose an approach. Firstly, 
we have classified the approaches into four types of categories: reconstruction errors, future frame prediction, 
scoring, and using classifiers. We also presented the strengths and weaknesses of each category according to 
several datasets. Each category can be applied in a supervised or unsupervised manner, but most researchers 
focused on tackling the anomaly detection problem by applying unsupervised learning. 

Furthermore, we have presented the different publicly available datasets with their details such as 
the video resolution and example anomalies found within the respective datasets, and we found that many 
datasets are more challenging than the others. Finally, we have discussed the results of several categories 
applied to different datasets. Aiming to tackle some problems and achieve good results in both the accuracy 
and computational complexity, there are research opportunities to develop a new approach based on vision 
transformer to improve the detection of anomaly object in video sequences. 
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APPENDIX 
Table 3. The results of different approaches according to several used dataset 
Dataset UCSD Subway CUHK Avenue UMN SHT UCF 
Ped 1 Ped 2 Entrance Exit All Scenes 
Frame-level Pixel-level Frame-level Frame Frame-level Frame-level Frame-level Pixel-level Frame-level Frame Frame 
Approach level level level 
Categories Ref EER AUC EER AUC EER AUC EER EER AUC EER AUC EER AUC EER AUC EER AUC AUC AUC 
44 50,30% 63,00% 70,70% 86,8% 76,50% 
40 10% 17% 
45 56,30% 67,50% 80,50% 91% 87,10% 
46] 40% 59,00% 81% 20,50% 30% 69,30% 71% 
47] 31% 67,50% 79% 19,70% 42% 55,60% 80% 96,00% 
Hand-crafted [37] 32% 68,80% 71% 21,30% 36% 61,30% 72% 
based methods 
37] 25% 81,80% 58% 44,10% 25% 82,90% 54% 16,70% 90,80% 16,40% 89,7% 
36] 15% 91,80% 43% 63,80% - 2 65,51% 
48] 19% 54% 45,30% 20% 24,40% 83,30% 26,40% 80,2% 97,80% 
41 87,00% 91,00% 99,70% 
35 19% 29,90% 
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[=] 
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5 
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53 94,10% 
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